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Abstract 


In  this  paper  we  present  a slight  modification  of  Wilson's  Edited 
Nearest  Neighbor  Rule  [l]  in  the  one  dimensional  case  for  which  it  is  possible 
to  compute  tight  bounds  on  the  average  asymptotic  risk.  It  is  pointed  out 
that  the  argument  used  by  Wilson  to  establish  his  bounds  is  probably  incorrect 
with  the  bounds  being  somewhat  optimistic.  The  rule  presented  here  is  not 
in  itself  of  any  great  significance  since  it  does  not  generalize  to  more  than 
one  dimension.  The  contribution  lies  in  the  fact  that  for  this  type  of  rule 
(which  is  very  similar  to  Wilson's  rule)  an  exact  analysis  is  possible 
which  permits  comparison  of  the  relative  merits  of  various  editing  schemes. 
Although  no  proof  is  offered,  the  strong  similarities  involved  give  reason 
to  believe  that  the  results  concerning  the  relative  efficiencies  of  the  various 
editing  schemes  will  carry  over  to  higher  dimensional  problems  with  the 
usual  version  of  the  nearest  neighbor  rule. 


Another  Look  at  the  Edited  Nearest  Neighbor  Rule 


Nearest  neighbor  rules  form  a widely  known  class  of  solutions  to 
problems  in  the  field  of  pattern  discrimination.  Several  papers  concerned 
with  the  various  properties  exhibited  by  these  rules  have  appeared  in  the 
literature,  most  of  them  directing  their  attention  toward  asymptotic  properties 
of  the  risk  when  the  rule  is  used  with  a data  collection  of  independent 
classified  observations.  Another  interesting  question  is  the  following.  Given 
data  consisting  of  n classified  observations,  it  will  sometimes  be  the  case 
that  some  of  the  observations  from  one  class  will  lie  in  a region  where  most 
of  the  observations  are  from  another  class.  In  such  a case,  it  may  be  pos- 
sible to  improve  the  rule's  performance  by  removing  from  the  data  those 
observations  which  are  "surrounded"  by  observations  from  a different  class. 
The  question  is,  is  there  an  effective  way  to  identify  those  observations 
which  should  be  eliminated. 

Wilson  [1]  has  examined  this  problem  and  proposed  the  following 
algorithm.  Take  each  sample  of  the  data  in  turn  and,  using  the  k nearest 
neighbor  rule  with  the  remainder  of  the  data,  estimate  its  classification. 

The  edited  data  set  is  obtained  by  removing  from  the  original  data  set  those 
samples  which  were  misclassified  by  their  k nearest  neighbors.  The  edited 
nearest  neighbor  rule  then  uses  the  single  nearest  neighbor  rule  in  con- 
junction with  the  edited  data  set  to  classify  unknown  observations. 

Wilson  used  an  argument  to  show  that 

EL  (k)  -*  RE(k) 
n 

where  Ln(k)  is  the  conditional  n sample  probability  of  error  for  the  edited 

nearest  neighbor  rule  which  uses  k neighbors  in  the  editing  process,  and 
£ 

where  R (k)  satisfies 

R*  £RE(1)  s 1.2R* 

R*  £ RE(3)  s 1.149R* 

R*  £ RE(5)  £ 1.10R*  . 


Unfortunately,  the  argument  is  incomplete.  On  page  413  of  [l],  Wilson 
gives  an  expression  for  <pkm(l/x)  which  is  claimed  to  be  the  asymptotic 
probability  that  an  observation  at  x is  assigned  to  class  1.  Actually 
<pkm(l/x)  is  just  the  proportion  of  samples  from  class  1 in  a small  neighbor- 

CD 

hood  of  x after  editing.  Unless  those  samples  are  uniformly  distributed  in 

Icm 

the  neighborhood,  cp  (1/x)  is  not  necessarily  related  to  the  probability  that 

CO 

x is  assigned  to  class  1.  Wilson  does  not  indicate  any  reason  why  the 
samples  should  be  uniformly  distributed  and  in  fact  intuition  seems  to 
indicate  that  the  editing  process  leaves  the  samples  distributed  in  clusters. 
(It  should  be  noted  that  Tomek  [2]  makes  the  same  error  in  arriving  at  his 
equation  13) . 

An  exact  analysis  of  the  effect  of  editing  on  the  average  asymptotic 

performance  can  be  done  if  we  restrict  the  problem  to  one  dimension  and 

use  a rule  which  selects  the  nearest  neighbor  to  a point  x from  those  samples 

which  are  greater  than  x.  It  is  important  to  point  out  two  things  about  this 

type  of  nearest  neighbor  rule.  First,  the  arguments  used  by  Cover  and  Hart 

[3]  are  still  applicable  so  that  these  rules'  asymptotic  performance  will  be 

indistinguishable  from  that  of  the  standard  nearest  neighbor  rules.  Second, 

Wilson's  argument  still  applies  to  the  edited  version  of  these  rules  so  that 

the  same  bounds  arrived  at  in  his  paper  would  still  apply,  if  his  argument 

were  correct.  In  fact,  however,  the  analysis  shows  his  bounds  to  be 

optimistic  in  this  case,  leaving  little  justification  for  thinking  them  correct 

in  the  other,  more  important  case. 

As  usual,  we  let  (X.  (X  , 0 ) be  a sequence  of  independent 

ii  n n 

identically  distributed  random  vectors  where  each  observation  takes 
values  in  IR  and  each  label  0^  takes  values  in  {1,2].  For  each  j, 

PC®,  = 1}=  v1 


i: 


P f 0j  = 2)=1-tt1  = tt2 

P{Xj  s.  x|0j  = i}  has  an  almost  everywhere  continuous  density 


1=1.2. 


B».T_ .. 


j 


The  following  two  lemmas  will  be  used  in  the  calculation  of  cp(l/x), 

the  asymptotic  probability  that  the  nearest  neighbor  to  x after  editing  is 

£ 

from  class  1.  «p  (1/x)  will  then  be  used  to  bound  R (k)  in  terms  of  R*. 

Let  x be  a continuity  point  of  ^ and  f 2 , and 


tt  ^(x) 


Pjlx)  = P(e=i  |x=xj  - " + n~f2(xj'  • 


P2(x)  = 1 - p x (x) , 


We  will  use  0 ^ to  denote  the  label  associated  with  the  k1*1  nearest  neighbor 
to  x (from  those  samples  greater  than  x) . Finally,  if  {sj} j=1  is  a sequence 
of  ones  and  twos  of  length  j , we  will  let  denote  the  event  that 

(©  ^ ■ Sj,.  ...9^  = sj}  . 

The  dependence  of  each  event  S.  on  n is  implicit  here. 

Lemma . Let  be  an  event  as  described  above,  where  the  corresponding 
sequence  contains  m ones  and  j-m  twos.  Then 

lim  P(S  ) = p1(x)mp2(x)i-m  . 
n-*®  J 

Proof.  The  proof  is  a simple  application  of  well  known  theorems  concerning 
the  convergence  of  the  nearest  neighbors  to  x.  (See  Cover  and  Hart  [3], 
Wagner  [4].) 

Now,  let  (Sj } be  a sequence  of  such  events,  with  depending 

only  on  0 ^ , . . . , 0^  . We  assume  that  Sj  and  S^  are  disjoint  for  all  i / j , 

and,  if  j > n,  then  S^  is  empty.  We  also  need  the  following  easy  lemma. 

Lemma . Let  (Sj  a sequence  of  events  as  described  above.  Then 

® m j-m 

lim  P(  u S ) ■ 23 Pi  P?  J(x) 
n-*®  j=l  1 j=l 


where  m^  is  the  number  of  ones  in  the  sequence  associated  with  S^ . 


The  computation  of  ep(l/x)  is  done  by  specifying  the  sequences  of 
labels  0^  which  yield  the  desired  result  after  editing.  The  lemmas  above 
are  then  used  to  find  the  limiting  probability  of  obtaining  one  of  the  necessary 
sequences . 

In  the  case  of  editing  with  a single  nearest  neighbor,  if  the  labels 
of  the  samples  to  the  right  of  x are  in  one  of  the  following  sequences,  then 
a class  1 sample  will  remain  to  the  right  of  x after  editing.  The  sequences 
are  given  as  X's  and  O's,  where  an  X denotes  class  1,  and  an  O denotes  class 

3 

2,  and  where,  for  example,  (XO)  indicates  XOXOXO.  The  sequences  of  interest  are 
(XO)^XX,  j ^ 0 and  (OX)^X,  j s 1.  By  the  use  of  the  above  lemmas,  we  can 
compute  the  limiting  probability  that  one  of  the  above  sequences  occurs  as 


*i<1/x>  = £P1(IW  + EP1V2) 


J=o 


i-PjPj 

Pi»+p2> 

‘-PiPa 


J=i 

2 

P1P2 

1_P1P2 


For  the  case  of  editing  with  three  nearest  neighbors,  the  sequences 
of  interest  are: 

XXXX 
XXXO 
XXOX 
XOXX 

XX(OOXX)JOX,  j 2 1 
X(OOXX)jX,  j 2 1 
X(OX)JX,  j a 2 
OXXX 

0(XX00)JXX0X,  j 2 1 
0(X0)JXX,  J 2 1 
0(XX00)JXXX,  J 2 1 
00(XX00)JXX0X,  J 2 1 
QO(XXOO)JXXX,  J 2 1 . 


* 


The  limiting  probability  of  obtaining  one  of  the  above  sequences  has  been 
calculated  to  be 

2 

3 P2(1+Pl)  p2(1+P2)tl+P2+PlP2(1+Pl)'1 

_ • 

In  the  case  of  editing  with  five  nearest  neighbors,  the  analysis  was 
done  in  the  same  fashion,  but  it  becomes  rather  tedious  so  only  the  result 
will  be  given. 

„ , , 6 .5  ,_  4 2 3 3 

<p5(l/x)  = Px  + 6pxp2  + 15pxp2  + lOp^ 


x Pl?2  ; 2 \ 

+ ^ (P1  ' ?2) 


1"(P1P2) 


3 3(P1-P2)(P1P2+P^P2) 


+ (P*-P2)  (5  + 2pip2  + 4pip2)  + ^pJ-P^1  ¥ p!p2^ 


, 4 4n 

(pi  - P2) 


We  also  include  the  result  obtained  by  editing  the  data  set  twice 


in  succession  with  one  nearest  neighbor. 


PoU-P,) 


n.i(1/x,  = r^(ltp2+r^r2)  • 

Finally  , the  computation  was  done  for  editing  with  one  nearest 
neighbor,  then  using  the  three  nearest  neighbor  rule  to  classify  unknowns. 


This  resulted  in 


P1  P1(1*P2>  f P1  , P1P2  s2 

d/x)=  5 + +(  ) 1 • 

(1-p  p2)  1 - p p l-PlP2  l-PlP2  U-P1P2) 


1_P1P2 


These  last  two  quantities  were  computed  to  gain  some  idea  of  the 


1 

effectiveness  of  variations  on  Wilson's  basic  idea. 

£ 

Finally,  to  obtain  the  bounds  on  R (It) , we  compute  r£(x)/rg(x), 
the  ratio  of  the  local  risk  for  the  edited  rule  to  the  local  risk  for  the  Bayes 
rule.  Note  that 

rB(x)  = min  (p^x),  p2(x)} 
r£(x)  = cp(l/x)p2(x)  + cp^/xjp^x) 

li 

= p^x)  + 9 (1/x)  - 2cp(l/x)p1(x)  . 

1 


For  0 < p (x)  a \ , we  have 

r£(x)  p^x)  + cp(l/x)  - 2Cp(l/x)p1(x) 
rB(x)  " Px(x) 

We  substitute  the  appropriate  expressions  for  cp(l/x)  es  computed  above  and 


find  the  maximum  as  a function  of  Pj . This  yields 

R*  *RE(1)  s 1.269R* 

R*  s RE(3)  * 1.204R* 

R*  s RE (5)  fi  1.169R*  . 

For  comparison  purposes,  the  bounds  for  the  standard  1,3,  and  5 nearest 
neighbor  rules  are 

R*  sR(l)  s2R* 

R*  SR(3)  * 1.31R* 

R*  £ R(5)  s 1.2R*  . 

The  improvements  made  possible  by  editing  the  data  set  are  obvious,  although 
not  quite  as  good  as  originally  suggested  by  Wilson.  The  result  obtained  by 
editing  the  data  twice  with  one  nearest  neighbor  is 

R*  <R2E(1)  * 1.162R*  . 


Finally,  editing  with  one  neighbor  followed  by  classifying  with  three  neighbors 
yields 

R*  sRj  (1)  s 1.168R*  . 

The  attractiveness  of  the  result  obtained  by  editing  twice  with  one 
nearest  neighbor  is  diminished  somewhat  by  the  fact  that  it  appears  to  be 
computationally  more  difficult  to  edit  twice  with  one  neighbor  than  to  edit 
once  with  several  neighbors.  We  note  that  Tomek  [2]  errs  in  stating  that  the 
opposite  is  true.  The  reason  is  that  in  either  case,  the  distance  between  the 
samples  must  be  computed  at  least  once.  However,  in  the  case  of  editing 
twice,  one  must  either  store  all  the  distances,  or  recompute  most  of  them  for 
the  second  edit.  In  the  case  of  editing  once  with  k neighbors,  it  is  necessary 
to  compute  the  distances  only  once,  storing  only  the  current  k nearest  neigh- 
bors as  the  distances  are  being  computed. 

One  remaining  point  of  interest  is  the  amount  of  data  reduction  to  be 
expected  when  the  data  set  is  edited.  Asymptotically,  this  will  depend  on 
R^,  the  asymptotic  risk  of  the  k nearest  neighbor  rule.  We  note  that  if  we 
let  Sn  be  the  number  of  samples  edited  out  of  the  data,  then  SR/n  is  simply 
the  deleted  estimate  of  (see  Cover  [5]).  Wagner  [6]  has  shown  that 
under  mild  conditions  satisfied  here,  Sn/n  -»  R^  in  probability,  so  asymp- 
totically the  edited  data  set  will  contain  a fraction  near  1 - R^  of  the 
number  of  samples  before  editing.  This  means  that  in  problems  which  have 
a small  value  of  R*,  the  amount  of  data  reduction  to  be  expected  from 
editing  is  negligible. 
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compute  tight  bounds  on  the  average  asymptotic  risk.  It  is  pointed  out  that 
the  argument  used  by  Wilson  to  establish  his  bounds  is  probably  incorrect 
with  the  bounds  being  somewhat  optimistic.  The  rule  presented  here  is  not 
in  itself  of  any  great  significance  since  it  does  not  generalize  to  more  than  _ 
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one  dimension.  The  contribution  lies  in  the  fact  that  for  this  type  of  rule 
(which  is  very  similar  to  Wilson's  rule)  an  exact  analysis  is  possible  which 
permits  comparison  of  the  relative  merits  of  various  editing  schemes.  Al- 
though no  proof  is  offered,  the  strong  similarities  involved  give  reason  to 
believe  that  the  results  concerning  the  relative  efficiencies  of  the  various 
editing  schemes  will  carry  over  to  higher  dimensional  problems  with  the 
usual  version  of  the  nearest  neighbor  rule.  * 
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